Experiments in Authorship-Link Ranking and Complete Author Clustering

نویسندگان

  • Valentin Zmiycharov
  • Dimitar Alexandrov
  • Hristo Georgiev
  • Yasen Kiprov
  • Georgi Georgiev
  • Ivan Koychev
  • Preslav Nakov
چکیده

The paper presents the approach we developed for the AuthorshipLink Ranking and Complete Author Clustering task at the PAN 2016 competition. Given a document collection, the task is to group documents written by the same author, so that each cluster corresponds to a different author. This task can also be viewed as one of establishing authorship links between documents. We use a combination of classification and agglomerative clustering with a rich set of features such as average sentence length, function words ratio, type-token ratio and part of speech tags.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Author Clustering using Hierarchical Clustering Analysis

This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two feature representation methods, log-entropy model, and tf-idf; while tuning minimum frequency threshold values to reduce the dimensionality. Our system was ranke...

متن کامل

Author Clustering based on Compression-based Dissimilarity Scores

The PAN 2017 Author Clustering task examines the two application scenarios complete author clustering and authorship-link ranking. In the first scenario, one must identify the number (k) of different authors within a document collection and assign each document to exactly one of the k clusters, where each cluster corresponds to a different author. In the second scenario, one must establish auth...

متن کامل

Author Identification Based on a Hybrid Feature Set Using Machine Learning and Clustering Techniques

Author identification of a document can be performed using computational or statistical method. In this paper, we try to identify the author of two ancient Arabic religious books dating from the 6th century: The holy Quran and the Hadith. Authorship identification consists in identifying the author of an anonymously document by using some techniques of Natural Language processing (NLP) and Arti...

متن کامل

Authoritative Re-Ranking in Fusing Authorship-Based Subcollection Search Results

We examine the use of authorship information to divide IR test collections into subcollections and apply techniques from the field of distributed information retrieval to enhance the baseline search results. We determine the expertise of each author, based on the content of their documents, and use this knowledge to construct rankings of the different author subcollections for each query. We go...

متن کامل

On co-authorship for author disambiguation

Author name disambiguation deals with clustering the same-name authors into different individuals. To attack the problem, many studies have employed a variety of disambiguation features such as coauthors, titles of papers/publications, topics of articles, emails/affiliations, etc. Among these, co-authorship is the most easily accessible and influential, since inter-person acquaintances represen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016